The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
#sns.set_theme(style='darkgrid')
from tabulate import tabulate
#format numeric data for easier readability
pd.set_option(
"display.float_format", lambda x: "%.2f" % x
) # to display numbers rounded off to 2 decimal places
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# Finding optimal no. of clusters
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
# mount from Google Drive
from google.colab import drive
drive.mount('/content/drive')
# loading the dataset(s)
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Projects/Project 7/stock_data.csv')
# copying the data to another variable to avoid any changes to original data
df = data.copy()
df.head()
# LEts randomly select 10 rows from the sample dataset for loading
df.sample(n=10)
df.tail(5)
df.shape
df.dtypes
df.info()
df[df.duplicated()].count()
df.duplicated().sum()
# let's verify if there are missing values in the data
data.isnull().sum().sort_values(ascending=False) # sort descending
###Checking for unique values of each column
print("Unique values in dataset: \n\n",df.nunique())
Data Overview Observations
df.describe(include = 'object')
# filtering object type columns
cat_columns = data.describe(include=["object"]).columns
cat_columns
for i in cat_columns:
print(data[i].value_counts())
print("*" * 40)
print("\n")
print(data[i].describe())
print("*" * 40)
print("\n")
Observation of Categorical Statistical Summary
# lets look at the numerical features.
df.describe().T
Some Observation to Note from the statistical summary of the entire dataset :
Some EDA Explorations:
(Following Questions answered in Bivariate Analysis)
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=12)
ax = sns.countplot(
data=data,
x=feature,
palette="viridis",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(df, feature, figsize=(12, 7), kde=True, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=df, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=df, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
df[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
df[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
df["GICS Sector"].value_counts(normalize=True)*100
# barplot visualization of stocks by sector
labeled_barplot(df, 'GICS Sector', n=5)
df["GICS Sub Industry"].value_counts(normalize=True)*100
# barplot visualization of sub industry for top 5
labeled_barplot(df, 'GICS Sub Industry', n=5)
The top 5 sub-industries with the highest count of stocks include
Note :
# filter the columns of the dataframe with numeric datatypes only
numeric_columns = df.select_dtypes(include=["int64", "float64"]).columns
for feature in numeric_columns:
histogram_boxplot(
df, feature, figsize=(12, 7), kde=True, bins=None)
Summary Observations
It can be noted that
Features : Current Price (Current Stock Price), Volatility, ROE (Return on Equity), Cash Ratio, Net Cash Flow, Net Income, Estimated Shares Outstanding and P/E Ratio exhibit a right skew, as their mean values are higher than their respective median values.
The presence of outliers on both ends of the boxplots such as for Price Change, Net Cash Flow, Net Income, Earnings Per Share and P/B Ratio indicates that there are both very large positive and very large negative changes in price within the dataset.
Although some of the variables contain outliers, the values may be reaistic as we are dealign with stock prices.
However, k-means clustering is sensitive to outliers and can have a substantial impact on the results of k-means clustering. Hence, we will carry out outlier detection andremoval.
#top 5 companies with EPS
top_5_companies = df.nlargest(5, 'Earnings Per Share')
top_5_companies = top_5_companies[['Security', 'GICS Sector','Earnings Per Share']]
table = top_5_companies.values.tolist()
headers = top_5_companies.columns.tolist()
print('Top 5 EPS Companies')
#using tabulate to reate a nice table
print(tabulate(table, headers, tablefmt="fancy_grid"))
# Add a line break
print("\n")
lowest_eps_companies = df.nsmallest(3, 'Earnings Per Share')
lowest_eps_companies = lowest_eps_companies[['Security', 'GICS Sector','Earnings Per Share']]
table2 = lowest_eps_companies.values.tolist()
headers2 = lowest_eps_companies.columns.tolist()
print('Lowest 3 EPS Companies')
#using tabulate to create a nice table
print(tabulate(table2, headers2, tablefmt="fancy_grid"))
EPS is an important indicator of a company's profitability on a per-share basis and is widely used by investors in evaluating stock performance.
Which economic sector have seen the maximum price increase on average?
# Calculate the average price change for each economic sector
df.groupby('GICS Sector')['Price Change'].mean().sort_values(ascending=False)
plt.figure(figsize=(15,8))
sns.boxplot(data=df, x='GICS Sector', y='Price Change')
plt.xticks(rotation=90);
# Calculate the average price change for each economic sector
average_price_change = df.groupby('GICS Sector')['Price Change'].mean()
#Identify the economic sector with the maximum average price change
max_price_change_sector = average_price_change.idxmax()
print("The stocks from", max_price_change_sector, "sector have seen the maximum price increase on average.")
Observation and Answer
How are the different variables correlated with each other?
#lets create a correlation heat map for the numerical variables
# numeric columns was defined ealrier above. Hence not requiured to repeat. Below for reference only
# numeric_columns = df.select_dtypes(include=["int64", "float64"]).columns
plt.figure(figsize=(14, 7))
sns.heatmap(
df[numeric_columns].corr(),
annot=True,
vmin=-1,
vmax=1,
fmt=".2f",
cmap='Spectral'
)
plt.show()
Observation of correlation between the following variables:
"Net Income" has the highest positive correlation with both "Estimated Shares Outstanding" and "Earnings Per Share,". This means that these variables have a strong linear relationship with each other.
Below are the potive correlation between variables :
Net Income and Estimated Shares Outstanding:
Net Income and Earnings Per Share:
Earnings_Per_Share and Current_Price:
Below are the negative correlation between variables :
"Volatility" has a negative correlation of -0.38 with both "Net Income" and "Earnings Per Share". THis means that there is a tendency for higher volatility to be associated with lower levels of net income and earnings per share.
Price_Change and Volatility :
Earnings_Per_Share and ROE:
Earnings_Per_Share and Volatility
Net income and Volatility
sns.pairplot(data=df[numeric_columns], diag_kind="kde")
plt.show()
Cash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
'''
To calculate the average cash ratio across economic sectors:
1. Group the data by the "GICS Sector" column.
2. Calculate the average cash ratio for each sector.
3. Display the average cash ratio for each sector.
'''
average_cash_ratio_by_sector = df.groupby("GICS Sector")["Cash Ratio"].mean().sort_values(ascending=False)
print(average_cash_ratio_by_sector)
# Cash_Ratio Vs. GICS_Sector
plt.figure(figsize=(15,8))
sns.boxplot(data=df, x='GICS Sector', y='Cash Ratio')
plt.xticks(rotation=90);
Observation and Answer
The cash ratio indicates the ability of a company within a specific sector to cover its short-term obligations using cash and cash equivalents.
P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
average_PE_ratio_by_sector = df.groupby("GICS Sector")["P/E Ratio"].mean().sort_values(ascending=False)
print(average_PE_ratio_by_sector)
plt.figure(figsize=(15,8))
sns.barplot(data=df, x='GICS Sector', y='P/E Ratio', ci=False)
plt.xticks(rotation=90);
Observation and Answer
# Lets get the top 5 GICS sectors.
# you can also use the .head()
top_5_sectors = df['GICS Sector'].value_counts().nlargest(5)
# Now lets get the top 5 GICS sub-industries for each top GICS sector
top_5_sub_industries = []
for sector in top_5_sectors.index:
sub_industries = df[df['GICS Sector'] == sector]['GICS Sub Industry'].value_counts().nlargest(5)
top_5_sub_industries.append(sub_industries)
# Display the count of sub-industries for each top GICS sector
for i, sector in enumerate(top_5_sectors.index):
print(f"Top 5 GICS Sub-Industries in {sector}:") #f : to format the string
print(top_5_sub_industries[i])
print()
# Plot the top 5 GICS sub-industries for each top GICS sector
for i, sector in enumerate(top_5_sectors.index):
plt.figure(figsize=(10, 6))
top_5_sub_industries[i].plot(kind='bar', color='green')
plt.xlabel('GICS Sub Industry')
plt.ylabel('Count')
plt.title(f'Top 5 GICS Sub-Industries in {sector}')
plt.show()
'''
One reason to fix the object dtype to category is to
reduce the memory usage of a DataFrame or a Series,
lead to faster processing times and the ability to
work with larger datasets that might not otherwise fit into memory.
Also beneficial for machine learning tasks
cat-columns was declared under breakdown or verify the categorical data further
'''
for i in cat_columns:
df[i] = df[i].astype("category")
df.drop('Ticker Symbol', axis = 1,inplace = True)
# vrify new dataset again
df.info()
Variable Ticker Symbol has successflly been dropped, all object dtypes has been converted to category datatype
# filter the columns of the dataframe with numeric datatypes only
numeric_columns = df.select_dtypes(include=["int64", "float64"]).columns
# Print the list of numerical columns
print(numeric_columns)
print('No of numerical columns are :',len(numeric_columns))
#Let's check for outliers in the data using boxplots
# to prevent errors in defining the number of subplots, we can use the ceiling mathematical function to determine the best number of subplots to plot
import math
n_rows = int(math.ceil(len(numeric_columns)/4))
plt.figure(figsize=(15, n_rows*4))
for i, variable in enumerate(numeric_columns):
plt.subplot(n_rows, 4, i + 1) # n_rows by 4 cols
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout(pad = 2)
plt.title(variable)
plt.show()
# functions to treat outliers by flooring and capping
def treat_outliers(df, col):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
Treat outliers in a list of variables
df: dataframe
col_list: list of dataframe columns
"""
for c in col_list:
df = treat_outliers(df, c)
return df
# create a new data frame after treating outliers in the colums
df1 = treat_outliers_all(df, numeric_columns)
# verify outliers have been treated
# outlier detection using boxplot
numeric_columns = df1.select_dtypes(include=["int64", "float64"]).columns
n_rows = int(math.ceil(len(numeric_columns)/4))
plt.figure(figsize=(15, n_rows*4))
for i, variable in enumerate(numeric_columns):
plt.subplot(n_rows, 4, i + 1) # n_rows by 4 cols
plt.boxplot(df1[variable], whis=1.5)
plt.tight_layout(pad = 2)
plt.title(variable)
plt.show()
Outliers have been treated.
Let's scale the data before we proceed with clustering.
scaler = StandardScaler()
subset = df1[numeric_columns].copy()
subset_scaled = scaler.fit_transform(subset)
# Creating a dataframe from the outlier treated scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
#display the first few rows of the subset_scaled DataFrame
subset_scaled_df.head()
# make copy of the treated scaled data set
k_means_df = subset_scaled_df.copy()
clusters = range(1, 15)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(subset_scaled_df)
prediction = model.predict(k_means_df)
distortion = (
sum(np.min(cdist(k_means_df, model.cluster_centers_, "euclidean"), axis=1))
/ k_means_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=10)
plt.show()
It appears that the optimal number of clusters (k) may be around 4 or 5.
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(k_means_df) # fit the data to the visualizer
visualizer.show(); # finalize and render figure
According to the KElbow Visualizer, the appropriate value of k = 5
sil_score = []
cluster_list = range(2, 15)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((subset_scaled_df))
score = silhouette_score(k_means_df, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.show()
Higher silhouette scores indicate better-defined and well-separated clusters.
Based on the provided silhouette scores for different numbers of clusters, the highest score is achieved when the number of clusters is 3 (silhouette score = 0.1818864171381463)
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 15), metric="silhouette", timings=True)
visualizer.fit(k_means_df) # fit the data to the visualizer
visualizer.show(); # finalize and render figure
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(k_means_df)
visualizer.show();
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(k_means_df)
visualizer.show();
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(k_means_df)
visualizer.show();
Higher average scores indicate better-defined, better cluster quality and well-separated clusters.
Conclusion for Final Model Cluster selection (n_clusters):
The highest silhouette score among the tested number of clusters was achieved when the number of clusters was set to 3, indicating that the data points were relatively well-clustered with this configuration.
Additionally, the KElbow method also suggested that the optimal number of clusters is 3 based on the elbow curve analysis.
Hence, it seems that 3 is a good value for the final model
# final K-means model
%%time
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(k_means_df)
# creating a copy of the original data
df2 = df.copy()
# adding kmeans cluster labels to the original and scaled dataframes
k_means_df["K_means_segments"] = kmeans.labels_
df2["K_means_segments"] = kmeans.labels_
km_cluster_profile = df2.groupby("K_means_segments").mean()
#count the number of observations or securities within each cluster segment.
km_cluster_profile["count_in_each_segment"] = (
df2.groupby("K_means_segments")["Security"].count().values
)
#highlight the maximum value in each column of the dataFrame km_cluster_profile
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
Observation
Based on the highlighted maximum values in each cluster segment in conjunction with the 3 cluseters selected in the final model:
Cluster Segment 0:
Cluster Segment 1:
Cluster Segment 2:
# simple print and count the companies in each cluster
cluster_counts = []
for cl in df2["K_means_segments"].unique():
companies = df2[df2["K_means_segments"] == cl]["Security"].unique()
count = len(companies)
cluster_counts.append([cl, count, ", ".join(companies)])
print(tabulate(cluster_counts, headers=["Cluster", "Count", "Companies(Security)"]))
# to print, segregate by GICS sector, and count the companies in each cluster
from tabulate import tabulate
cluster_counts = []
for cl in df2["K_means_segments"].unique():
cluster_data = []
for sector in df2["GICS Sector"].unique():
companies = df2[(df2["K_means_segments"] == cl) & (df2["GICS Sector"] == sector)]["Security"].unique()
count = len(companies)
cluster_data.append((sector, count, ", ".join(companies)))
cluster_counts.append([cl, cluster_data])
headers = ["Cluster", "Sector", "Count", "Companies(Security)"]
table_data = []
for cluster in cluster_counts:
cl = cluster[0]
data = cluster[1]
for entry in data:
sector, count, companies = entry
table_data.append([cl, sector, count, companies])
print(tabulate(table_data, headers=headers))
# to simply print the GICS sector in a given segment (without the company names)
df2.groupby(["K_means_segments", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
# Selecting numerical columns
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
# Determine the number of rows and columns for the subplots
num_plots = len(numeric_columns)
num_rows = num_plots // 4 + (num_plots % 4 > 0)
num_cols = min(num_plots, 4)
for i, variable in enumerate(numeric_columns):
plt.subplot(num_rows, num_cols, i + 1)
sns.boxplot(data=df2, x="K_means_segments", y=variable)
plt.xlabel("Cluster")
plt.ylabel(variable)
plt.tight_layout(pad=2.0)
plt.show()
From the cluster grouping tabulation, we can deduce the following insights:
Cluster 0:
Cluster 1:
Cluster 2:
(measures the correlation between the pairwise distances of the original data points and the pairwise distances of the clustered data points)
# make a copy of the scaled data frame (outlier treated scaled dataframe)
hc_df = subset_scaled_df.copy()
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hc_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print('*'*100)
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Observation
Based on these coefficients:
Cityblock distance and average linkage also exhibit relatively higher Cophenetic correlation coefficients.
This suggests that these methods preserve the original distances in the data more effectively compared to other combinations of distance metrics and linkage methods.
Let's explore different linkage methods with Euclidean distance only.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hc_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hc_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Observation
Based on these results, we can observe that average linkage and centroid linkage methods have relatively higher Cophenetic correlation coefficients
Let's view the dendrograms for the different linkage methods with Euclidean distance.
%%time
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
compare = [] #save the results of the cophenetic correlation calculation
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hc_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hc_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
compare.append([method, coph_corr])
# create and print a dataframe to compare cophenetic correlations for different linkage methods
df_cc = pd.DataFrame(compare, columns=compare_cols)
df_cc = df_cc.sort_values(by="Cophenetic Coefficient")
df_cc
Known Interpretation
Unknown Intepretation
Student Note : In the dendogram, the x-axis contains all the samples in the dataset and y-axis represents the distance between these samples
%%time
HCmodel = AgglomerativeClustering(n_clusters=4, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)
# creating a copy of the original data
df3 = df.copy()
# adding hierarchical cluster labels to the original and scaled dataframes
hc_df["HC_Clusters"] = HCmodel.labels_
df3["HC_Clusters"] = HCmodel.labels_
#calculates the mean value for each group
hc_cluster_profile = df3.groupby("HC_Clusters").mean()
hc_cluster_profile["count_in_each_segments"] = (
df3.groupby("HC_Clusters")["Security"].count().values
)
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0)
Based on the highlighted mean maximum values in each cluster segment in conjunction with the 4 cluseters selected in the final HC model:
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
# simple print and count the companies in each cluster
from tabulate import tabulate
cluster_counts = []
for cl in df3["HC_Clusters"].unique():
companies = df3[df3["HC_Clusters"] == cl]["Security"].unique()
count = len(companies)
cluster_counts.append([cl, count, ", ".join(companies)])
print(tabulate(cluster_counts, headers=["Cluster", "Count", "Companies"]))
# to print, segregate by GICS sector, and count the companies in each cluster
from tabulate import tabulate
cluster_counts = []
for cl in df3["HC_Clusters"].unique():
cluster_data = []
for sector in df3["GICS Sector"].unique():
companies = df3[(df3["HC_Clusters"] == cl) & (df3["GICS Sector"] == sector)]["Security"].unique()
count = len(companies)
cluster_data.append((sector, count, ", ".join(companies)))
cluster_counts.append([cl, cluster_data])
headers = ["Cluster", "Sector", "Count", "Companies"]
table_data = []
for cluster in cluster_counts:
cl = cluster[0]
data = cluster[1]
for entry in data:
sector, count, companies = entry
table_data.append([cl, sector, count, companies])
print(tabulate(table_data, headers=headers))
# to simply print the GICS sector in a given segment (without the company names)
df3.groupby(["HC_Clusters", "GICS Sector"])['Security'].count()
plt.figure(figsize=(20, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(numeric_columns):
plt.subplot(3, 4, i + 1)
sns.boxplot(data=df3, x="HC_Clusters", y=variable)
plt.tight_layout(pad=2.0)
a) Cluster 0 shows diverse representation of sectors and has the highest count of companies, with 291 companies. This indicates that a significant portion of the dataset belongs to this cluster.
b) In terms of sector distribution or concentration of companies within Cluster 0, the 3 top represented sectors are Industrials (53 companies), Financials (49 companies), and Health Care (32 companies).
c) Cluster 1 and 2 has a smaller number of sectors, 17 and 31 companies respectively compared to Cluster 0.
d) Top 3 secotrs in Cluster 1 are : Health Care (8), and Information Technology(5) and Consumer Discretionary (3). Top 3 sectors in Cluster 2 primarily consists of Energy (24), Information Technology (4) and Materials sectors (2).
e) Cluster 3 only includes 1 sector - Consumer Discretionary (Food BEverage)
f) Cluster 2 shows a significant representation of the Energy sector, with 24 companies. This indicates a higher concentration of energy-related companies within Cluster 2 compared to other sectors.
g) Cluster 1 is characterized by having the highest count of companies in the Health Care sector
h) Clusters with 0 companies indicate that there were no companies that fit the criteria or met the similarity thresholds to be assigned to those clusters.
Comparison over several things :
Q1. Which clustering technique took less time for execution?
Q2. Which clustering technique gave you more distinct clusters, or are they the same?
Q3. How many observations are there in the similar clusters of both algorithms?
Q4. How many clusters are obtained as the appropriate number of clusters from both algorithms?
# Q3. How many observations are there in the similar clusters of both algorithms?
# this code is for selected sector analysis only. The cross tab provides further analysis.
kmeans_cluster_companies = {}
for cl in df2["K_means_segments"].unique():
companies = df2[df2["K_means_segments"] == cl]["Security"].unique()
kmeans_cluster_companies[cl] = set(companies)
hc_cluster_companies = {}
for cl in df3["HC_Clusters"].unique():
companies = df3[df3["HC_Clusters"] == cl]["Security"].unique()
hc_cluster_companies[cl] = set(companies)
similar_clusters = []
kmeans_unique_clusters = []
hc_unique_clusters = []
for kmeans_cluster, kmeans_companies in kmeans_cluster_companies.items():
for hc_cluster, hc_companies in hc_cluster_companies.items():
if kmeans_companies == hc_companies:
similar_clusters.append((kmeans_cluster, hc_cluster))
if kmeans_cluster not in [x[0] for x in similar_clusters]:
kmeans_unique_clusters.append(kmeans_cluster)
for hc_cluster, hc_companies in hc_cluster_companies.items():
if hc_cluster not in [x[1] for x in similar_clusters]:
hc_unique_clusters.append(hc_cluster)
print("Similar Clusters:")
for kmeans_cluster, hc_cluster in similar_clusters:
print("KMeans Cluster:", kmeans_cluster)
print("HC Cluster:", hc_cluster)
print()
print("KMeans Unique Clusters:")
for kmeans_cluster in kmeans_unique_clusters:
print("KMeans Cluster:", kmeans_cluster)
print()
print("HC Unique Clusters:")
for hc_cluster in hc_unique_clusters:
print("HC Cluster:", hc_cluster)
print()
# Showing the specific cases mentioned
kmeans_cluster0_companies = kmeans_cluster_companies[0]
hc_cluster0_companies = hc_cluster_companies[0]
hc_cluster0_unique_companies = hc_cluster_companies[0].difference(kmeans_cluster0_companies)
kmeans_cluster2_companies = kmeans_cluster_companies[2]
hc_cluster3_companies = hc_cluster_companies[3]
hc_cluster3_unique_companies = hc_cluster_companies[3].difference(kmeans_cluster2_companies)
print("Similar Companies:")
print("Cluster 0 in KMeans algorithm:", sorted(kmeans_cluster0_companies))
print("Cluster 0 in Hierarchical Clustering Agglomerative algorithm:", sorted(hc_cluster0_companies))
print("Unique companies in Cluster 0 Hierarchical Clustering Agglomerative algorithm:", sorted(hc_cluster0_unique_companies))
print()
print("Cluster 2 in KMeans algorithm:", sorted(kmeans_cluster2_companies))
print("Cluster 3 in Hierarchical Clustering Agglomerative algorithm:", sorted(hc_cluster3_companies))
print("Unique companies in Cluster 3 Hierarchical Clustering Agglomerative algorithm:", sorted(hc_cluster3_unique_companies))
pd.crosstab(df2['GICS Sector'], [df2['K_means_segments'], df3['HC_Clusters']]).style.highlight_max(color='lightgreen', axis=0)
pd.crosstab(df2['Security'], [df2['K_means_segments'], df3['HC_Clusters']]).style.highlight_max(color='lightgreen', axis=0)
-
**A. INSIGHTS FROM EDA**
**INSIGHTS FROM CLUSTER ANALYSIS**
We have two clustering algorithms: K-means and Hierarchical Clustering Agglomerative (HC). Folowing key observations and insights for each algorithm:
The companies in this segment are primarily from the Financials sector, followed by Industrials and Information Technology.
Cluster Segment 1: This segment has the highest current prices and the highest count of securities. It includes companies from various sectors, indicating a more balanced distribution across industries.
Cluster Segment 2: This segment represents securities with higher volatility in stock prices, high cash ratios, and relatively higher P/E and P/B ratios. Energy sector companies dominate this cluster.
Cluster 0: This cluster has moderate current prices, relatively low volatility, positive price changes, and diverse representation from different sectors. It has the highest net income among all clusters and the highest count of companies.
Cluster 1: This cluster has the highest price changes and net cash flow. It also has a significant net income, indicating healthy profitability.
Cluster 2: This cluster has lower current prices, negative price changes, higher volatility, and the highest ROE, estimated shares outstanding, and P/E ratio among all clusters. It includes companies from the Energy sector.
Cluster 3: This cluster has the highest current prices, volatility, cash ratio, EPS, and P/B ratio. It represents a single company, Chipotle Mexican Grill, with unique characteristics.
K-means Cluster Grouping: Cluster 0 has a diverse mix of sectors, with notable representation from Financials, Industrials, and Information Technology. Cluster 1 shows a more balanced distribution across sectors, while Cluster 2 is dominated by Energy companies.
HC Cluster Grouping: Cluster 0 has the highest count of companies and diverse sector representation, similar to K-means Cluster 0. Cluster 1 has fewer sectors represented, and Cluster 2 is primarily composed of Energy companies. Cluster 3 represents a single company from Consumer Discretionary sector, Chipotle Mexican Grill.
B. RECOMMENDATIONS
Here are some recommendations, suggestions, and advice for Trade&Ahead:
1. Utilize K-means Clustering:
2. Analyze the distribution:
3. Assess Risk and Valuation:
4. Explore Hierarchical Clustering:
5. Regular Monitoring and Update Clusters: